1 Background

1.1 Algoritma

The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. No part of this coursebook may be reproduced in any form without permission in writing from the authors.

Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc. Visit our website for all upcoming workshops.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

#knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
rm(list=ls())

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)

2 Interactive Visualization

As data grow in complexity and size, often times the designer is tasked with the difficult task of balancing overarching storytelling with specificity in their narrative. The designer is also tasked with striking a fine balance between coverage and details under the all-too-real constraints of static graphs and plots.

Interactive visualization is a mean of overcoming these constraints, and as we’ll see later, quite a successful one at that. Quoting from the author of superheat Rebecca Barter, “Interactivity allows the viewer to engage with your data in ways impossible by static graphs. With an interactive plot, viewers can zoom into areas they care about, highlight data points that are relevant to them and hide the information that isn’t.”

More than just interactive visualization, we’ll also learn in this 3-day workshop how to make full-fledged interactive documents, interactive dashboards, and as a bonus, how to create multi-paged PDF documents with the ideal layout of our plots.

I’ll start by introducing plotly.

3 Plotly

Plotly is an interactive, browser-based graphing library that helps data analysts create interactive, high-quality graphs in one of the many supported languages.

Building on what we’ve learned in our last workshop, we’ll learn how to add some nice enhancements and interactivity to our plots using plotly. This works entirelly locally and through the HTML widgets framework, allowing you to create interactive plots directly within RStudio.

We’ll read our data in and perform the (hopefully by now) standard preprocessing procedure:

vids <- read.csv("youtubetrends.csv")
vids$likesratio <- vids$likes/vids$views
vids$dislikesratio <- vids$dislikes/vids$views

Recall that our videos can take one of the 16 possible video categories. We’ve primarily been working with the News and Media category in our last workshop, so for a change of scenery we’ll be using the videos in the Comedy category for most of our examples.

table(vids$category_id)
## 
##     Autos and Vehicles                 Comedy              Education 
##                     41                    273                    107 
##          Entertainment     Film and Animation                 Gaming 
##                    736                    152                     30 
##        Howto and Style                  Music      News and Politics 
##                    285                    391                    271 
## Nonprofit and Activism       People and Blogs       Pets and Animals 
##                      8                    228                     66 
## Science and Technology                  Shows                 Sports 
##                    175                      1                    188 
##      Travel and Events 
##                     34

Also recalled how we created our custom theme together in the last workshop using theme. Because you’re studying at Algoritma, we’ll save our theme as theme_algoritma:

theme_algoritma <- theme(legend.key = element_rect(fill="black"),
           legend.background = element_rect(color="white", fill="#263238"),
           plot.subtitle = element_text(size=6, color="white"),
           panel.background = element_rect(fill="#dddddd"),
           panel.border = element_rect(fill=NA),
           panel.grid.minor.x = element_blank(),
           panel.grid.major.x = element_blank(),
           panel.grid.major.y = element_line(color="darkgrey", linetype=2),
           panel.grid.minor.y = element_blank(),
           plot.background = element_rect(fill="#263238"),
           text = element_text(color="white"),
           axis.text = element_text(color="white")
           
           )

Try and spend a couple of minutes on the code above and fully understand what each line does. This should not be too foreign to you by now! We’ll apply this theme a lot in subsequent ggplot graphics and feel free to revisit this chunk and make any aesthetic adjustments to your liking.

In the past, we’ve relied on R’s base functionality for data preparation, I want to show you a technique that may greatly increase your productivity when working with R. This technique is developed as “a grammar of data manipulation”, and works by providing a consistent set of “verbs” that help you solve the most common data manipulation challenges:
- mutate() adds new variable

vids <- mutate(vids, likeability = likes/dislikes)
  • select() keeps only the variables we mentioned
channels <- select(vids, c(channel_title, category_id))
  • filter() returns only the rows based on conditions
filter(vids, views>=25000000)
##   trending_date                                                   title
## 1    2017-11-14             Ed Sheeran - Perfect (Official Music Video)
## 2    2017-11-30 Marvel Studios' Avengers: Infinity War Official Trailer
##          channel_title   category_id        publish_time    views   likes
## 1           Ed Sheeran         Music 2017-11-09 06:04:14 33523622 1634124
## 2 Marvel Entertainment Entertainment 2017-11-29 08:26:24 37736281 1735895
##   dislikes comment_count comments_disabled ratings_disabled
## 1    21082         85067             FALSE            FALSE
## 2    21969        241237             FALSE            FALSE
##   video_error_or_removed publish_hour publish_when publish_wday
## 1                  FALSE            6  12am to 8am     Thursday
## 2                  FALSE            8   8am to 3pm    Wednesday
##   timetotrend likesratio dislikesratio likeability
## 1           5 0.04874545  0.0006288700    77.51276
## 2           1 0.04600069  0.0005821718    79.01566
  • summarise() returns a summary statistics (min, length, mean etc)

Each of these verbs also work with group_by() which allows us to perform any operation “by group”. I’ve attached a full copy of the dplyr cheatsheet in your directory.

A common operation with dplyr is to use group_by and summarise to get a new summary dataframe. This combines nicely with any additional verbs we add to it, through chaining (%>%). Sounds a little abstract, so let’s dive into an example:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
v.favor <- vids %>% 
  group_by(category_id) %>%
  summarise(likeratio = mean(likes/views), 
            dlikeratio = mean(dislikes/views)
            ) %>%
  mutate(favor = likeratio/dlikeratio)

v.favor
## # A tibble: 16 x 4
##    category_id            likeratio dlikeratio favor
##    <fct>                      <dbl>      <dbl> <dbl>
##  1 Autos and Vehicles        0.0176   0.00103  17.2 
##  2 Comedy                    0.0522   0.00138  37.7 
##  3 Education                 0.0437   0.00143  30.5 
##  4 Entertainment             0.0341   0.00175  19.5 
##  5 Film and Animation        0.0338   0.00152  22.3 
##  6 Gaming                    0.0480   0.00208  23.1 
##  7 Howto and Style           0.0524   0.00147  35.7 
##  8 Music                     0.0724   0.00171  42.3 
##  9 News and Politics         0.0146   0.00394   3.71
## 10 Nonprofit and Activism    0.0245   0.000729 33.6 
## 11 People and Blogs          0.0520   0.00271  19.2 
## 12 Pets and Animals          0.0453   0.000776 58.4 
## 13 Science and Technology    0.0383   0.00141  27.1 
## 14 Shows                     0.0322   0.00163  19.7 
## 15 Sports                    0.0173   0.00120  14.4 
## 16 Travel and Events         0.0285   0.00193  14.7

Now using the v.favor dataframe we created and the theme_algoritma we wrote in our last workshop, let’s build a ggplot:

colp <- ggplot(v.favor, aes(x=category_id, y=favor))+
  geom_col(fill="dodgerblue4")+
  coord_flip()+
  labs(title="Favorability Index by Video Category, 2018")+
  theme_algoritma
colp

A simple but pleasant looking bar plot. Adding interactivity using plotly is as simple as wrapping our ggplot object into the ggplotly function:

ggplotly(colp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

If you hover your mouse over, notice the tool tip that shows you the value of our “favorability” index by each category.

Updated (29 June): If ggplotly glitches out or return an incorrect-looking plot, try and install ggplot2 from Hadley’s github repo using devtools::install_github('hadley/ggplot2') and restart your R session

Let’s see another example of using the dplyr grammar. Supposed we like to create a summary table that counts the number of appearance each “Comedy” channel has made in that period of trending videos, we could have written the following:

comedy <- vids[vids$category_id == "Comedy", ]
comedy <- aggregate(trending_date ~ channel_title, comedy, length)
comedy <- comedy[order(comedy$trending_date, decreasing=T), ]
names(comedy) <- c("channel_title", "count")
head(comedy)
##                             channel_title count
## 91 The Tonight Show Starring Jimmy Fallon    30
## 60            Late Night with Seth Meyers    21
## 25                           CollegeHumor    15
## 44                         IISuperwomanII    12
## 80                              RM Videos    10
## 34                   ExplosmEntertainment     9

Quiz 1: Using dplyr

Could we have done it easier with dplyr? Refer back to the earlier code chunk and the cheatsheet in your folder to see if you can rewrite the code in dplyr.

From this point on, I’ll leave the creative decision up to you - write R whichever way you prefer! For the most part, to keep the course materials relatively beginner-friendly I’ll use the base R method but where it greatly simplify things, I’ll use dplyr in future courses and will expect you to understand them.

Now let’s create a second ggplot object, I’ll name it hexp:

hexp <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_point(aes(size=views), alpha=0.5, show.legend = F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp

Wrapping hexp in our ggplotly() function yields an interactive HTML widget:

ggplotly(hexp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Plotly works with time series data as well. To show you an illustration of this, I’ll read the economics dataset that ships with ggplot. economics_long is a US economic time series and I’ll use the first three columns of it:

el <- as.data.frame(economics_long[,1:3])

Creating a facet_grid ggplot object with varying y-scales on each of the grid:

econp <- ggplot(el, aes(date, value, group=variable)) + 
  geom_line()+
  facet_grid(variable ~ ., scale = "free_y")+
  labs(title="US Economic time series")+
  theme_algoritma
econp

Creating our ggplotly object to add interactivity in our plots:

ggplotly(econp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Because it is a plotly object, you can also use supporting plotly functions such as rangeslider() to add a range slider to the x-axis.

rangeslider(ggplotly(econp))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Now let’s take it further with our plotly experimentation. First, we’ll create a long format data frame containing videos that have comment enabled:

vids.m <- vids[vids$comments_disabled == F,c(4,7,8,9)]
# 4,8,7,9 pointing to category_id, likes, dislikes, comment_count
vids.m <- melt(vids.m)
## Using category_id as id variables

As we create our ggplot, then wrap it in ggplotly() as we’ve been doing above:

cplot <- ggplot(vids.m, aes(x=category_id, y=value))+
  # position can also be stack
  geom_col(position="dodge", aes(fill=vids.m$variable))+
  coord_flip()
ggplotly(cplot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Quiz 2: Hands-on Plotly Observe the many different functionalities of plotly by playing around with the icon bar in the widget. Try and do each of the following at least once: - Switch from “Show closest data on hover” to “Compare data on hover”
- Toggle Spike Lines
- Click on Legend items to toggle visibility

As a bonus exercise, try and create your own unique plotly starting from the raw data (vids). You are free to use any subsetting and pick any plot type - but the end result have to be a plotly object created using the ggplotly() function. When you’re done, we’ll move onto the next chapter!

4 Publication and Layout Options

We’ll now going to create a multi-page PDF containing all the plots we’ve created so far. To give our publication a consistent style, let’s apply our theme_algoritma to the last plot we created:

cplot <- cplot + theme_algoritma
cplot

Through the ggpubr package, we’ll use the ggarrange to put the 4 plots we created in earlier steps together into a list. Because we specify nrow=2, we imagine that the resulting would be a list of 2 objects, each containing 2 rows (one for each plot):

publicat <- ggarrange(hexp, econp, cplot, colp, nrow=2)

Let’s take a look at the first item on our publicat list:

publicat[[1]]

As well as the second:

publicat[[2]]

Once we’re happy with the result, we can use ggexport() and specify a file name. This will export the list as a multi-page PDF:

ggexport(publicat, filename="publication.pdf")
## file saved to publication.pdf

To visualize interactively, just print publicat from your console or document.

Similar to ggarrange(), plotly allow us to put different plots together into one plotly object using the subplot() function. In this plot, there are 4 subplots, and interacting with any one of them will cause the other subplots to react accordingly to your input:

subplot(
  cplot,
  hexp, 
  colp,
  econp,
  nrows=4)

To see another example, I’m going to go ahead and create 4 ggplots:

hexp <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_point(aes(size=views), alpha=0.5, show.legend = F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp

hexp2 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_hex(alpha=0.6, show.legend = F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp2

hexp3 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_line(col="black", show.legend = F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp3

hexp4 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_bin2d(show.legend=F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp4

And use subplot() to arrange them together into one plotly widget with the specified widths.

subplot(
hexp, hexp2, hexp3, hexp4,
  nrows=2, shareX=T, shareY=T, widths=c(0.65, 0.35))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Note that as we use the interactive selection tools or zoom in on any part of the plot (either plot) the other plots will be refreshed accordingly - a pretty neat feature considering how simple it is to set it up!

Note that the common title automatically takes the last plot’s title, so in this case the common (shared) title inherits from hexp4. As of its current development cycle, plotly does not support titles or any similar functionalities yet so adding a subplot title or even a mutual title is a bit hackerish (using annotate()1) and beyond the scope of this coursebook. As and when this change in a future release / update, I will update this coursebook accordingly to include examples.

5 Flex Dashboard

Flex Dashboard is an R package that “easily create flexible, attractive, interactive dashboards with R”. Authoring and customization of dashboards is done using R Markdown with the flexdashboard::flex_dashboard output format. To get started, install flexdashboard using the standard installation process you should be familiar by now:
install.packages(flexdashboard)

When that is done, create a new R Markdown document from within RStudio, choose “From Template” and then Flex Dashboard as following:

The template code that was generated for you takes some default value - for example it chooses to have a columns orientation and set your layout to fill.

If you like your plots to change in height so as to fill the web page vertically, the vertical_layout: fill (default) setting should be kept. If you want the charts to maintain their original height instead, this makes it necessary to have page scrolling in order to accommodate all your plots. That can be done by setting vertical_layout to a scrolling layout using scroll.

Within each of the code chunk of the Rmd template code that was generated for you, you will find it common to enter:
- R graphical output (plot(), ggplot()) - Interactive JavaScript data visualization based on htmlwidgets (plotly)
- Tabular data (table())
- Common summary data, text, values etc

6 Summary

Congratulations on getting started with making interactive plots using plotly and flexdashboard - in the remaining sessions of this workshop we’ll look at creating an interactive document that allow the end user to interact with our creation and we’ll publish our project onto the web in the learn-by-building module.

I hope you’re starting to feel more accomplished from the earlier days when we are all learning the ropes in our first few session. As always, the secret to fluency is practice!

Happy coding!

Samuel

7 Reference Answer

comedy2 <- vids %>% 
  filter(category_id == "Comedy") %>%
  group_by(channel_title) %>%
  summarise(count = n()) %>%
  arrange(desc(count))